Improving Brill's Pos Tagger for an Agglutinative Language
نویسنده
چکیده
In this paper Brill's rule-based PoS tagger is tested and adapted for Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English (and other Germanic languages) because of the structural difference between these languages. Hungarian, unlike English, has rich morphology, is agglutinative with some inflectional characteristics and has fairly free word order. The tagger has the greatest difficulties with parts-of-speech belonging to open classes because of their complicated morphological structure. It is shown that the accuracy of tagging can be increased from approximately 83% to 97% by simply changing the rule generating mechanisms, namely the lexical templates in the lexical training module.
منابع مشابه
Part-of-Speech (POS) Tagging Revisited
Accurate part-of-speech (POS) tagging of natural language text data can add power to automated information retrieval and extraction. Brill's transformation-based learning (TBL) approach to automated POS tagging was introduced in 1992, combining virtues of rule-based and stochastic methods. Brill's innovative idea was to use machine learning techniques to search through all of rule space for the...
متن کاملWill the Identification of Reduplicated Multiword Expression (RMWE) Improve the Performance of SVM Based Manipuri POS Tagging?
Reduplicated Multiword Expressions (RMWEs) are abundant in Manipuri, the highly agglutinative India language. The Part of Speech (POS) tagging of Manipuri using Support Vector Machine (SVM) has been developed and evaluated. The POS tagger has been updated with identified RMWEs as another feature. The performance of the SVM based POS tagger before and after adding RMWE as a feature have been com...
متن کاملBrill’s Pos Tagger with Extended Lexical Templates for Hungarian
In this paper Brill’s rule-based PoS tagger is tested and adapted to Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English because of the structural difference between these languages. Hungarian has rich morphology, is agglutinative with inflectional characteristics and has free word order. The tagger has the greatest difficulties w...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملIndependence and Commitment: Assumptions for Rapid Training and Execution of Rule-based POS Taggers
This paper addresses the rule-based POS tagging method of Brill, and questions the importance of rule interactions to its performance. Adopting two assumptions that serve to exclude rule interactions during tagging and training, we arrive at some variants of Brill's approach that are instances of decision list models. These models allow for both rapid training on large data sets and rapid tagge...
متن کامل